This practice is part of the subject Biomedical Data Science of the Degree in Data Science of the Universitat Politècnica de València, and taught by the Department of Applied Physics.
The measurement of data quality dimensions (DQ) is the central axis for the evaluation and improvement of data quality as well as for its correct and optimal use. Data quality dimensions are individual aspects or constructs that represent data quality attributes. To these can be associated one or more metrics, quantified with specific methods, as well as exploratory methods.
This practice is intended to provide an initial basis for the evaluation of DQ metrics. It will consist of the application of a series of methods for different dimensions of DQ. In the context of the maternal and child clinical setting, we will analyze a data file whose ultimate purpose is the monitoring of care indicators in this setting. Depending on the dimension, we will apply the methods and calculate the metrics both in general for the whole dataset and monitored over batches of time (by months), simulating the results of a DQ monitoring and continuous improvement system.
In some parts of the code we will find the text ##TODO## that we will need to complete. Additionally, we will have to discuss the results in those points where it is indicated. The deliverable of the practice will consist of the compilation in html of this R markdown file, using Knit, where the results of the execution and figures will be observed, and having completed the ##TODO## and commented the results.
We check that the working directory is in the one where we have the practice file and the folder with the data:
getwd()
Otherwise, we set it (change the example directory to ours):
setwd(".")
We install the required libraries and then load them into the working environment.
# install.packages("zoo", repos = "http://cran.us.r-project.org")
# install.packages("rts", repos = "http://cran.us.r-project.org")
# install.packages("plotly", repos = "http://cran.us.r-project.org")
# install.packages("devtools", repos = "http://cran.us.r-project.org")
# library("devtools")
# devtools::install_github('c5sire/datacheck')
# devtools::install_github("hms-dbmi/EHRtemporalVariability")
library("shiny")
library("zoo")
library("rts")
library("plotly")
library("datacheck")
library("EHRtemporalVariability")
We set the initial parameters of the data. The main date of the records, which will be used for the purpose of monitoring the delivery care indicators, is the date of birth.
# File name
fileName = "data/DQIinfantFeeding.csv"
# Whether it has a header or not
hasHeader = TRUE
# Main date column to be used for monitoring purposes
dateColumn = "BIRTH_DATE"
# Format of the previous date
dateFormat = '%d/%m/%Y'
# Which text string will represent missing data
missingValue = NaN
We load the file data/DQIinfantFeeding.csv in a data.frame named repository:
repository <- read.csv2(fileName, header=hasHeader, na.strings=missingValue)
# We collect the number of rows and columns
N <- nrow(repository)
D <- ncol(repository)
For monitoring purposes, we will use the zoo library (S3 Infrastructure for Regular and Irregular Time Series - Z’s Ordered Observations) to convert the data, the data.frame, to a format suited for batch analyses, the zoo format.
zooRepository <- read.zoo(repository,format = dateFormat,index.column = dateColumn)
One of the main uses of the maternal and infant data repository studied is the monitoring of quality of care indicators. In the field of newborn feeding, one of the most important indicators is whether there has been early initiation of breastfeeding in the delivery room. To calculate this indicator, we create the following function that will obtain the indicator for each batch of data received, so that we can apply it repeatedly for each batch given a frequency.
indicatorEBF_delroom <- function(dataset){
numerator = (dataset$NEO_MOMFBF_TYPE %in% 'During the 1st hour') &
(dataset$NEO_PFBF_TYPE %in% 'Delivery room') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
denominator = (dataset$NEO_MOMFBF_TYPE %in% c('During the 1st hour', 'During the 2nd hour', 'During the 3rd hour','Breastfeeding does not start')) &
(dataset$NEO_PFBF_TYPE %in% c('Delivery room', 'Hospitalization unit', 'Breastfeeding does not start')) &
!(dataset$NEO_FBFEVAL_TY %in% 'Undesired breastfeeding') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
indicator = sum(numerator)/sum(denominator) * 100
return(indicator)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resIndicatorES2SC_delroom =apply.monthly(zooRepository, FUN=indicatorEBF_delroom)
plot(resIndicatorES2SC_delroom,xlab = "Date", ylab ="%",main = "Early breastfeeding start in the delivery room", ylim=c(0,100))
📝 DISCUSS RESULTS
…
We will find the missing data in the repository and calculate the corresponding metrics. First, for each variable:
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
completenessByColumn
## PATIENT_ID DEL_TYPE_A DEL_MODE DEL_TYPE_ANE DEL_IN_INTRACA
## 100.00 100.00 100.00 100.00 100.00
## BIRTH_DATE NEO_ES2SC_TYPE NEO_ES2SCP_TYPE NEO_MOMFBF_TYPE NEO_PFBF_TYPE
## 100.00 100.00 100.00 100.00 100.00
## NEO_FBFEVAL_TY NEO_APGAR_NUM NEO_WEIGHT_NUM NEO_PHCORD_TXT NEO_BF_DIS
## 100.00 31.31 95.42 100.00 100.00
Next, we will calculate and display the overall percentage of missing data:
totalNonMissing <- sum(NAmatrix)
totalValues <- N * D
completenessByDataset <- round((totalNonMissing / totalValues) * 100, 2)
completenessByDataset
## [1] 95.12
📝 DISCUSS RESULTS
Los datos son los los faltantes sino para los que sí que hay valores.
To monitor the completeness by temporary batches we will create a function that does this calculation for each batch it receives as a parameter, and returns the completeness for each column, the function dimCompletessByColumn:
dimCompletenessByColumn <- function(repository){
N = dim(repository)[1]
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
return(completenessByColumn)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resCompletenessByColumn = apply.monthly(zooRepository, dimCompletenessByColumn)
Now, we can create a plot with the results using the plotly library (Create Interactive Web Graphics via ‘plotly.js’). First for each variable:
p <-
plot_ly(
x = index(resCompletenessByColumn),
y = resCompletenessByColumn[, 1],
name = names(resCompletenessByColumn)[1],
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Completeness by month',
xaxis = list(title = "Date"),
yaxis = list (title = "Completeness (%)")
)
for (i in 2:ncol(resCompletenessByColumn)) {
p = p %>% plotly::add_trace(y = resCompletenessByColumn[, i],
name = names(resCompletenessByColumn)[i],
mode = 'lines')
}
div(p, align = "center")
And secondly globally, being able to calculate the result from the variable resCompletenessByColumn and use the code to plot a single series from the previous code chunk:
globalCompletenessByMonth <- rowMeans(resCompletenessByColumn, na.rm = TRUE)
p <- plot_ly(
x = index(resCompletenessByColumn),
y = globalCompletenessByMonth,
name = "Global Completeness",
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Global Completeness by Month',
xaxis = list(title = "Date"),
yaxis = list(title = "Completeness (%)")
)
div(p, align = "center")
📝 DISCUSS RESULTS
…
We are going to analyze two multivariate consistency rules in the data. For this we will use the datacheck library (Tools for Checking Data Consistency), which allows us to write logical rules in R language, using the variable names of our data. These rules will be written in the attached file rules.R, the first one is provided as an example, and the second one should be coded based on the provided natural language expression rule.
# We read the rules file
rules = read_rules("rules.R")
# We evaluate the rules on our data
profile = datadict_profile(repository, rules)
# We show the account of errors for each rule
knitr::kable(profile$checks[,c(1,3,6)])
| Variable | Rule | Error.sum |
|---|---|---|
| DEL_MODE | !(sapply(DEL_MODE, tolower) %in% “elective caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“epidural anesthesia”, “epidural anesthesia / general anesthesia”, “spinal anesthesia”, “spinal anesthesia / general anesthesia”, “general anesthesia”)) | 220 |
| DEL_MODE | !(sapply(DEL_MODE, tolower) %in% “emergency caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“spinal anesthesia”, “spinal anesthesia / general anesthesia”, “epidural anesthesia / general anesthesia”, “general anesthesia”)) | 245 |
# We list the cases that have been marked as inconsistent for each rule
library(kableExtra)
knitr::kable(profile$checks[, c(1,7)]) %>%
kable_styling() %>%
scroll_box(width = "100%")
| Variable | Error.list |
|---|---|
| DEL_MODE | 472,500,525,583,636,638,654,683,698,699,729,793,802,811,821,838,846,865,866,890,918,969,975,985,987,1008,1010,1013,1066,1076,1082,1129,1150,1190,1194,1241,1246,1270,1300,1367,1383,1384,1388,1402,1411,1428,1433,1445,1446,1453,1459,1501,1508,1531,1559,1575,1578,1612,1616,1635,1665,1667,1669,1674,1685,1692,1719,1753,1770,1774,1781,1784,1787,1790,1791,1800,1802,1836,1839,1840,1855,1857,1861,1874,1883,1902,1925,1928,1933,1939,1950,1963,1970,1972,1984,1987,2000,2001,2010,2049,2055,2058,2066,2102,2122,2134,2135,2145,2169,2251,2278,2292,2322,2334,2348,2368,2369,2386,2401,2409,2457,2474,2493,2518,2530,2550,2551,2563,2583,2591,2592,2623,2624,2632,2647,2676,2679,2680,2686,2706,2716,2732,2762,2767,2778,2822,2857,2865,2883,2901,2907,2920,2923,2934,2935,2943,2963,2965,2981,2985,3002,3006,3016,3018,3024,3029,3039,3073,3099,3106,3107,3119,3126,3158,3179,3183,3188,3201,3216,3221,3222,3240,3254,3256,3259,3323,3344,3356,3359,3363,3365,3380,3391,3410,3418,3427,3443,3484,3519,3531,3550,3553,3560,3575,3592,3603,3618,3622,3627,3630,3638,3648,3660,3670,3676,3677,3698,3704,3707,3761 |
| DEL_MODE | 58,60,66,157,165,167,182,224,251,256,270,301,322,323,347,350,375,382,392,395,428,463,473,484,492,495,503,508,512,521,524,526,534,538,539,543,552,556,572,576,585,592,597,601,604,607,616,617,618,633,635,640,641,643,652,658,660,663,669,677,680,684,688,689,696,702,707,710,711,712,714,717,733,740,742,748,759,761,763,767,768,770,776,779,782,791,795,797,804,814,828,836,839,842,844,847,857,874,878,882,888,892,901,903,910,911,914,917,919,925,926,930,931,935,941,946,961,971,977,998,1007,1017,1022,1023,1024,1025,1032,1033,1035,1048,1050,1051,1052,1053,1056,1060,1062,1067,1072,1078,1084,1086,1088,1095,1103,1110,1111,1124,1126,1131,1132,1133,1135,1137,1139,1140,1143,1144,1146,1148,1163,1171,1179,1184,1187,1188,1191,1197,1199,1210,1214,1216,1220,1224,1234,1237,1244,1250,1255,1257,1258,1268,1273,1278,1282,1283,1286,1291,1294,1295,1297,1299,1308,1321,1329,1337,1353,1355,1357,1360,1361,1369,1376,1378,1380,1382,1385,1395,1400,1412,1413,1414,1417,1421,1426,1434,1438,1440,1442,1450,1456,1461,1464,1465,1472,1473,1476,1477,1479,1481,1488,1496,1504,1512,1519,1522,1537,1565,1568,1581,1583,1600,1604,1610,3149 |
📝 DISCUSS RESULTS
…
We are going to analyze if there are pattern changes in the data distributions over time. To do this we will use the EHRtemporalVariability library (Delineating Temporal Dataset Shifts in Electronic Health Records). First, we change to basic type Date the case date, originally in text format:
repositoryFormatted <- EHRtemporalVariability::formatDate(
input = repository,
dateColumn = "BIRTH_DATE",
dateFormat = dateFormat
)
We obtain the temporal maps from the already formatted repository using the function estimateDataTemporalMap and selecting a monthly period. We can get the help of the function by typing ?estimateDataTemporalMap in the console (it is only necessary to pass the data, the column with the date, and the desired period, the rest is obtained automatically from the data).
probMaps <- estimateDataTemporalMap(
data = repositoryFormatted,
dateColumn = "BIRTH_DATE",
period = "month"
)
Next we obtain the information geometry plots from the previously estimated temporal maps. To do this we will use the function estimateIGTProjection on the list of temporal maps. We are going to save the results in a variable.
igtProjs <- sapply( probMaps, estimateIGTProjection )
names( igtProjs ) <- names( probMaps )
We can observe as an example the data temporal map and information geometry temporal (IGT) plot of the variable type of anesthesia during delivery DEL_TYPE_ANE:
plotDataTemporalMap(probMaps$DEL_TYPE_ANE)
div(plotIGTProjection(igtProjs$DEL_TYPE_ANE, trajectory = TRUE), align = "center")
In this example, we can see that over time there are changes or differences associated with synonymy of terms (No anesthesia, no, Without anesthesia), or even differences between upper and lower case (spinal, Spinal).
Next, we are going to save the results of the temporal variability estimation in a file, so that we can upload them to the web application EHRtemporalVariability, in the Load your RData tab.
save(probMaps, igtProjs, file = "variabilityData.RData")
Either using the web application, or obtaining the graphs in RStudio, we will study the temporal evolution of the type of delivery variable DEL_MODE, and discuss the results and their potential implications in the problem defined at the beginning of the practice on the monitoring of the early onset of lactation indicator, as well as propose a possible solution.
📝 DISCUSS RESULTS
…
The maternal and infant data studied have been kindly provided by Ricardo García de León for educational purposes only. Their use or distribution outside this subject and practice is forbidden.